VAST Challenge 2021 MC2

R
Visualization
Interactive Charts

To visualize & analyze card usage and car movement data with the employee disappearance incident

LIU Yangguang https://www.linkedin.com/in/ygliu/ (School of Computing and Information Systems, Singapore Management)
07-23-2021

Background

This study is based on the Mini-Challenge 2 of the VAST Challenge 2021. In a fiction scenario, there is a natural gas company named “GASTech” operating in the island country if Kronos. The GASTech didn’t do well in environment stewardship. And after an company IPO celebration in January 2014, several employees of GASTech went missing. An environment organization is suspected in the disappearance.

Many of the Abila, Kronos-based employees of GAStech have company cars which are approved for both personal and business use. Those who do not have company cars have the ability to check out company trucks for business use, but these trucks cannot be used for personal business.

Employees with company cars are happy to have these vehicles, because the company cars are generally much higher quality than the cars they would be able to afford otherwise. However, GAStech does not trust their employees. Without the employees? knowledge, GAStech has installed geospatial tracking software in the company vehicles. The vehicles are tracked periodically as long as they are moving.

This vehicle tracking data has been made available to law enforcement to support their investigation. Unfortunately, data is not available for the day the GAStech employees went missing. Data is only available for the two weeks prior to the disappearance.

To promote local businesses, Kronos based companies provide a Kronos Kares benefit card to GASTech employees giving them discounts and rewards in exchange for collecting information about their credit card purchases and preferences as recorded on loyalty cards. This data has been made available to investigators in the hopes that it can help resolve the situation. However, Kronos Kares does not collect personal information beyond purchases.

Requirement

Use visual analytics to identify which GASTech employees made which purchases and identify suspicious patterns of behavior. Besides, the study must cope with uncertainties that result from missing, conflicting, and imperfect data to make recommendations for further investigation.

Questions

  1. Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? Please limit your answer to 8 images and 300 words.

  2. Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.

  3. Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.

  4. Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

  5. Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.

Literature review

The VAST Challenge 2014 has the same scenario with slightly different dataset and questions. The submission repository can be found here.

Various analytic tools were used among the submissions, like JMP, D3 and custom tools. The heatmap and time histograms were useful to represent the numerical value under the combination of one categorical variable and one discrete/categorical variable, such as the usage frequency under different locations and days. Besides, movement line graph with the map background can help to identify and check suspicious activities.

However, almost all graphs were static and readers would find it difficult to explore other parts in graphs which were not specially mentioned by authors. Since the study is displayed on html page, the interactive graphs will be possible. For example, the tooltip function can make every data point to have detailed information without checking the axis or drawing additional graphs. The zoom-in and onclick functions allow readers to check the whole complex graph with too many lines/objects and focus on one part only.

Data Preparation

Data Wrangling

Import packages.

Card Usage data

The location names contain some special characters, such as “Café”, which are not recognized by utf-8 encoding. Thus, special encoding is used in reading data.

loyalty <- read_csv("data/loyalty_data.csv", locale=locale(encoding ="windows-1252"))
cc <- read_csv("data/cc_data.csv", locale=locale(encoding ="windows-1252"))

Take a glimpse of credit card data and loyalty card data

knitr::kable(cc[c(0:5),],
             caption = "Credit Card  Usage Data") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 1: Credit Card Usage Data
timestamp location price last4ccnum
01/06/2014 07:28 Brew’ve Been Served 11.34 4795
01/06/2014 07:34 Hallowed Grounds 52.22 7108
01/06/2014 07:35 Brew’ve Been Served 8.33 6816
01/06/2014 07:36 Hallowed Grounds 16.72 9617
01/06/2014 07:37 Brew’ve Been Served 4.24 7384
knitr::kable(loyalty[c(0:5),],
             caption = "Loyalty Card  Usage Data") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 2: Loyalty Card Usage Data
timestamp location price loyaltynum
01/06/2014 Brew’ve Been Served 4.17 L2247
01/06/2014 Brew’ve Been Served 9.60 L9406
01/06/2014 Hallowed Grounds 16.53 L8328
01/06/2014 Coffee Shack 11.51 L6417
01/06/2014 Hallowed Grounds 12.93 L1107

The timestamp in the credit card usage date (“cc”) contains date and time, while the timestamp in the loyal card usage data (“loyalty”) contains only data. Besides, their data type is string, which will be transformed into datetime type.

And we separate day, hour from the datetime feature.

loyalty$timestamp <- as.Date(loyalty$timestamp, "%m/%d/%Y")
cc$timestamp <- strptime(cc$timestamp, "%m/%d/%Y %H:%M")

loyalty$day <- mday(loyalty$timestamp)
cc$date <- as.Date(cc$timestamp, "%m/%d/%Y %H:%M")
cc$day <- mday(cc$date)
cc$hour <- hour(cc$timestamp)
GPS and car assignments
gps <- read_csv("data/gps.csv")

knitr::kable(gps[c(0:5),],
             caption = "GPS Data") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 3: GPS Data
Timestamp id lat long
01/06/2014 06:28:01 35 36.07623 24.87469
01/06/2014 06:28:01 35 36.07622 24.87460
01/06/2014 06:28:03 35 36.07621 24.87444
01/06/2014 06:28:05 35 36.07622 24.87425
01/06/2014 06:28:06 35 36.07621 24.87417

The timestamp in the GPS data also need to be transformed.

And the longitude and latitude are rounded into 5 digits. It can avoid the inconsistent/inaccurate GPS data to some extent. And five decimal places implies 1.11 meters accuracy, which is better than 4 or 6 digits (11.1 meter or 0.11 meter accuracy) under this question scenario.

# transform features
gps$Timestamp <- strptime(gps$Timestamp, "%m/%d/%Y %H:%M:%S")
gps$day <- mday(gps$Timestamp)
### round the gps into 5 digits
gps$lat <- round(gps$lat, digits = 5)
gps$long <- round(gps$long, digits = 5)
# use individual gps2 to find stop locations
gps2 <- gps

In the challenge page, it mentioned that the vehicles are tracked periodically as long as they are moving. Thus, the time gap in the GPS data within one car indicates that this car stopped at current GPS location. Stops correspond to local business locations or other locations. To find these business locations, we excluded out the time gap less than 3 minutes, which might be that the car stopped to wait for traffic light.

gps2 <- gps2 %>% 
  group_by(id) %>% 
  mutate(end = Timestamp,
         start = lag(Timestamp, default = first(Timestamp),
                   order_by = Timestamp),
         diff_mins = difftime(end, start, units = "mins")) %>% 
  mutate(stop = ifelse(diff_mins >= 3, TRUE, FALSE)) %>% 
  filter(stop == TRUE) %>% 
  ungroup()
# rearrange useful features
gps2_stop <- gps2[c(7,6,2,3,4,8,5)]

gps2_stop_sf <- st_as_sf(gps2_stop,
                         coords = c("long", "lat"), # combine the lo, la
                         crs = 4326) # 4326 is wgs84 Geographic Coordinate System

The “start” in the “gps2_stop_sf” refers to the time when the car starts parking, while the “end” refers to the time when the car ends parking.

Besides, most vehicles are assigned one-to-one. Only truck drivers are not assigned cars but are allowed to use available truck for business

car_assignments <- read_csv("data/car-assignments.csv")
# check car assignment data
knitr::kable(car_assignments,
             caption = "Car assignment") %>% 
  kableExtra::kable_paper("hover", full_width = F) %>% 
  kableExtra::scroll_box(height = "300px")
Table 4: Car assignment
LastName FirstName CarID CurrentEmploymentType CurrentEmploymentTitle
Calixto Nils 1 Information Technology IT Helpdesk
Azada Lars 2 Engineering Engineer
Balas Felix 3 Engineering Engineer
Barranco Ingrid 4 Executive SVP/CFO
Baza Isak 5 Information Technology IT Technician
Bergen Linnea 6 Information Technology IT Group Manager
Orilla Elsa 7 Engineering Drill Technician
Alcazar Lucas 8 Information Technology IT Technician
Cazar Gustav 9 Engineering Drill Technician
Campo-Corrente Ada 10 Executive SVP/CIO
Calzas Axel 11 Engineering Hydraulic Technician
Cocinaro Hideki 12 Security Site Control
Ferro Inga 13 Security Site Control
Dedos Lidelse 14 Engineering Engineering Group Manager
Bodrogi Loreto 15 Security Site Control
Vann Isia 16 Security Perimeter Control
Flecha Sven 17 Information Technology IT Technician
Frente Birgitta 18 Engineering Geologist
Frente Vira 19 Engineering Hydraulic Technician
Fusil Stenig 20 Security Building Control
Osvaldo Hennie 21 Security Perimeter Control
Nubarron Adra 22 Security Badging Office
Lagos Varja 23 Security Badging Office
Mies Minke 24 Security Perimeter Control
Herrero Kanon 25 Engineering Geologist
Onda Marin 26 Engineering Drill Site Manager
Orilla Kare 27 Engineering Drill Technician
Borrasca Isande 28 Engineering Drill Technician
Ovan Bertrand 29 Facilities Facilities Group Manager
Resumir Felix 30 Security Security Group Manager
Sanjorge Jr.  Sten 31 Executive President/CEO
Strum Orhan 32 Executive SVP/COO
Tempestad Brand 33 Engineering Drill Technician
Vann Edvard 34 Security Perimeter Control
Vasco-Pais Willem 35 Executive Environmental Safety Advisor
Hafon Albina NA Facilities Truck Driver
Hawelon Benito NA Facilities Truck Driver
Hawelon Claudio NA Facilities Truck Driver
Mies Henk NA Facilities Truck Driver
Morlun Valeria NA Facilities Truck Driver
Morlun Adan NA Facilities Truck Driver
Morluniau Cecilia NA Facilities Truck Driver
Nant Irene NA Facilities Truck Driver
Scozzese Dylan NA Facilities Truck Driver
gps2_stop_sf <- left_join(gps2_stop_sf, 
                          car_assignments, by = c("id" = "CarID"))

Lastly, we also need to draw the car movement path on the map. It requires the GPS data to be coordinate formats and one path is actually one line string with multiple GPS points.

# convert values from numerical to factor data type
gps$day <- as.factor(gps$day)
gps$id <- as_factor(gps$id)

gps_sf <- st_as_sf(gps,
                   coords = c("long", "lat"),
                   crs = 4326)
# group car paths
gps_path <- gps_sf %>%
  group_by(id, day) %>%
  summarize(m =mean(Timestamp),
            do_union=FALSE) %>%
  st_cast("LINESTRING")

QGIS

The tourist map provided is not georeferenced. And QGIS can help to georeference an image with the ESRI shapefiles (geospatial vector data) of the city.

The process includes:

  1. load JPG tourist map and shp road map
  2. create several referencing points between two maps
  3. start georeferencing maps and check the correspondence

After the process, we will get a tif file which is a combination of tourist map and georeferenced road map. Then we can plot car movements line with longitude and latitude data on the map.

we need to import the tif file generated by QGIS and display the map.

bgmap <- raster("data/Geospatial/MC2-tourist.tif")

Visualization and Insights

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

To identify popularity, we can calculate the card usage frequency and amount in every locations of different days and hours.

Firstly, let’s plot the frequency of cards in the 14 days. We need to calculate the card usage frequency in different days, convert into data frame, draw their heatmaps and plot together.

Q1-Fig1 Code
# calculate the frequency data frame of credit and loyalty card usage
cc_freq_day <- as.data.frame(xtabs(~location+day, data = cc))
loyalty_freq_day <- as.data.frame(xtabs(~location+day, data = loyalty))

# join the two frequency data frame
freq_day_join <- full_join(cc_freq_day,loyalty_freq_day,by= c("location","day"))
names(freq_day_join) <- c("location","day","CC_Freq","Loyalty_Freq")
# transfer from factors to numeric with original values
freq_day_join$day <- as.numeric(levels(freq_day_join$day))[freq_day_join$day]
# plot the heatmap of credit card usage frequency 
p1 <- ggplot(freq_day_join,aes(x=day,y=location))+
  geom_tile(aes(fill=CC_Freq))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.title=element_blank())
# plot the heatmap of loyalty card usage frequency 
p2 <- ggplot(freq_day_join,aes(x=day,y=location))+
  geom_tile(aes(fill=Loyalty_Freq))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.title=element_blank())
# convert static graph into interactive
plotly::subplot(ggplotly(p1),
                ggplotly(p2),
                shareY = TRUE)

Figure 1: Daily Frequency of Credit (left) and Loyalty (right) Card Usage

From the card usage frequency (or consumption frequency), we can easily identify that “Katerina’s Café”, “Hippokampos” and “Brew’ve Been Served” are the most popular with almost all squares in deeper color, where the daily consumption frequency is above 10. “Hallowed Grounds” and “Guy’s Gyros” are slightly less popular.

Besides, we can find that “Brew’ve Been Served” and “Hallowed Grounds” are popular every day except weekends (day 11-12, 18-19). The frequency are 0 on weekends, which might because the location is closed on weekends. It’s the same to “Hallowed Grounds”.

On weekends, “Katerina’s Café” and “Hippokampos” are the most popular while other locations might be closed or less consumption these days.

As for anomalies, we can see there is one white line in the graph for loyalty card, corresponding to “Daily Dealz”. This location only have one credit card consumption record on day 13 and no loyalty card record among the two weeks.

The daily frequencies are the same between “Maximum Iron and Steel” and “Kronos Pipe and Irrigation” every day in the two weeks.

To correct these anomalies, we can check the GPS data to make sure who made the only one consumption in “Daily Dealz”. If there were no anomalies after checking, we can just delete this single record in the credit card data. And for the situation between “Maximum Iron and Steel” and “Kronos Pipe and Irrigation”, it’s just coincidence after checking the consumption amount.

Secondly, we can plot the consumption amount instead of frequency. The steps are almost the same.

Q1-Fig2 Code
cc_price_matrix <- tapply(cc$price,cc[,c("location","day")],sum)
cc_price <- reshape2::melt(cc_price_matrix)
cc_price <- na.omit(cc_price)
p1_price <- ggplot(cc_price,aes(x=day,y=location))+
  geom_tile(aes(fill=value))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.title=element_blank())

ggplotly(p1_price)

Figure 2: Daily Consumption Amount of Credit Card

Q1-Fig3 Code
plot_ly(cc_price, x = ~value, y = ~location, type = "box",
        boxpoints = "outliers", marker = list(color= 'rgb(255,0,0)')) %>% 
  layout(showlegend = FALSE)

Figure 3: Daily Consumption Amount of Credit Card (box plot)

The consumption amount differences among locations are much bigger than frequency differences.

Apparently, “Abila Airport” are the place where has the biggest consumption amount. And these consumption occurred on weekdays only.

Besides, “Stewart and Sons Fabrication”, “Nationwide Refinery” and “Abila Airport” also have high consumption amounts. All these locations don’t show high frequency values in previous graphs but have very high daily consumption amounts.

And there are many outliers which might be anomalies. For example, “Frydos Autosupply n’ More” had a daily cc consumption amount ($10455.22) on day 13, which is several times as much as those in other days. And the “Albert’s Fine Clothing” also has a daily consumption outlier on day 17.

What’s more, there are many inconsistencies between amounts in the credit card record and loyalty card record. At “Stewart and Sons Fabrication”, the daily amounts from day 13 to day 16 don’t match in two graphs.

To correct these anomalies, we need to check through the car movement data where the consumption amount outliers exist. It’s to see whether there are activities or other gathering to cause the high consumption. As for the inconsistency in amounts, the possible explanations are there might be someone used only one of the two cards or got cashback in the consumption.

Lastly, we change the time unit from days to hours to analyze the popular locations. Only the timestamp of credit card data contains time, so there are no hourly heatmaps for loyalty card usage.

Q1-Fig4
cc_freq_hour <- as.data.frame(xtabs(~location+hour, data = cc))
# convert factor into number
cc_freq_hour$hour <- as.numeric(levels(cc_freq_hour$hour))[cc_freq_hour$hour]

cc_price_hour_matrix <- tapply(cc$price,cc[,c("location","hour")],sum)
cc_price_hour <- reshape2::melt(cc_price_hour_matrix)

cc_hour_join <- full_join(cc_freq_hour, cc_price_hour, by= c("location","hour"))
names(cc_hour_join) <- c("location","hour","Freq","Amount")

p3_freq <- ggplot(cc_hour_join,aes(x=hour,y=location))+
  geom_tile(aes(fill=Freq))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

p3_price <- ggplot(cc_hour_join,aes(x=hour,y=location))+
  geom_tile(aes(fill=Amount))+
  scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
  theme(panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

plotly::subplot(ggplotly(p3_freq),
        ggplotly(p3_price),
        shareY = TRUE) %>% 
  hide_colorbar()

Figure 4: Hourly Consumption Frequency and Amount of Credit Card

From the left hourly heatmap, we can easily identify the popular period for each locations since there are clear pattern.

And some anomalies exist in the strange time period. At 3am, there are 5 credit card usages in “Kronos Mart”. For “Daily Dealz”, the only credit card transaction happened at 6am.

The right heatmap also shows anomalies: the consumption at “Bean There Done That” and “Coffee Shack” all happened at 12 o’clock.

knitr::kable(cc %>% 
               filter(location == "Bean There Done That" | location == "Coffee Shack" ),
             caption = "Consumption record at Bean There Done That and Coffee Shack ") %>% 
  kableExtra::kable_paper("hover", full_width = F) %>% 
  kableExtra::scroll_box(height = "300px") 
Table 5: Consumption record at Bean There Done That and Coffee Shack
timestamp location price last4ccnum date day hour
2014-01-06 12:00:00 Coffee Shack 51.51 7117 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 10.28 1415 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 16.09 9635 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 8.54 1877 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 5.24 1321 2014-01-06 6 12
2014-01-06 12:00:00 Bean There Done That 4.27 6895 2014-01-06 6 12
2014-01-07 12:00:00 Coffee Shack 16.63 7117 2014-01-07 7 12
2014-01-07 12:00:00 Bean There Done That 51.25 1415 2014-01-07 7 12
2014-01-07 12:00:00 Bean There Done That 53.89 1877 2014-01-07 7 12
2014-01-07 12:00:00 Bean There Done That 8.03 1321 2014-01-07 7 12
2014-01-07 12:00:00 Bean There Done That 46.25 6895 2014-01-07 7 12
2014-01-08 12:00:00 Coffee Shack 13.13 7117 2014-01-08 8 12
2014-01-08 12:00:00 Bean There Done That 17.45 1415 2014-01-08 8 12
2014-01-08 12:00:00 Bean There Done That 12.07 9635 2014-01-08 8 12
2014-01-08 12:00:00 Bean There Done That 8.06 1877 2014-01-08 8 12
2014-01-08 12:00:00 Bean There Done That 94.96 1321 2014-01-08 8 12
2014-01-09 12:00:00 Coffee Shack 5.01 7117 2014-01-09 9 12
2014-01-09 12:00:00 Bean There Done That 13.00 1415 2014-01-09 9 12
2014-01-09 12:00:00 Bean There Done That 4.06 1877 2014-01-09 9 12
2014-01-09 12:00:00 Bean There Done That 15.53 1321 2014-01-09 9 12
2014-01-10 12:00:00 Bean There Done That 15.39 1415 2014-01-10 10 12
2014-01-10 12:00:00 Bean There Done That 3.92 9635 2014-01-10 10 12
2014-01-10 12:00:00 Bean There Done That 13.41 1877 2014-01-10 10 12
2014-01-10 12:00:00 Bean There Done That 18.67 1321 2014-01-10 10 12
2014-01-10 12:00:00 Bean There Done That 9.30 6895 2014-01-10 10 12
2014-01-13 12:00:00 Coffee Shack 4.09 7117 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 7.93 1415 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 98.02 9635 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 19.56 1877 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 5.54 1321 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 10.37 1874 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 15.24 9617 2014-01-13 13 12
2014-01-13 12:00:00 Bean There Done That 18.59 6895 2014-01-13 13 12
2014-01-14 12:00:00 Coffee Shack 3.63 7117 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 7.28 1415 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 19.37 9635 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 14.56 1877 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 67.82 1321 2014-01-14 14 12
2014-01-14 12:00:00 Bean There Done That 5.90 6895 2014-01-14 14 12
2014-01-15 12:00:00 Bean There Done That 5.34 9635 2014-01-15 15 12
2014-01-15 12:00:00 Bean There Done That 93.13 1877 2014-01-15 15 12
2014-01-15 12:00:00 Bean There Done That 18.96 1321 2014-01-15 15 12
2014-01-15 12:00:00 Bean There Done That 15.24 6895 2014-01-15 15 12
2014-01-16 12:00:00 Coffee Shack 10.48 7117 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 19.82 1415 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 12.33 9635 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 5.56 1877 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 98.34 1321 2014-01-16 16 12
2014-01-16 12:00:00 Bean There Done That 17.40 6895 2014-01-16 16 12
2014-01-17 12:00:00 Coffee Shack 19.16 7117 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 19.67 1415 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 8.21 9635 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 19.56 1877 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 18.72 1321 2014-01-17 17 12
2014-01-17 12:00:00 Bean There Done That 5.77 6895 2014-01-17 17 12

We can see that all consumption records are at 12:00:00. There might be something wrong with the machine which records credit card consumption at the two location.

Q2: GPS Data and Anomalies

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Anomaly 1: High consumption at “Frydos Autosupply n’ More” on day 13, Figure 2

Firstly, filter out the credit card consumption record at “Frydos Autosupply n’ More” on day 13.

knitr::kable(cc %>% 
               filter(day == 13 & location == "Frydos Autosupply n' More"),
             caption = "Consumption record at Frydos Autosupply n' More on day 13") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 6: Consumption record at Frydos Autosupply n’ More on day 13
timestamp location price last4ccnum date day hour
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551 2014-01-13 13 19
2014-01-13 19:41:00 Frydos Autosupply n’ More 188.57 8129 2014-01-13 13 19
2014-01-13 19:59:00 Frydos Autosupply n’ More 64.60 8411 2014-01-13 13 19
2014-01-13 21:11:00 Frydos Autosupply n’ More 202.05 2418 2014-01-13 13 21

The abnormal consumption is from the cc number 9551. Let’s check the consumption records of this cc on day 13.

knitr::kable(cc %>% 
               filter(day == 13 & last4ccnum == 9551),
             caption = "Consumption record of cc 9551 owner on day 13") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 7: Consumption record of cc 9551 owner on day 13
timestamp location price last4ccnum date day hour
2014-01-13 06:04:00 Daily Dealz 2.01 9551 2014-01-13 13 6
2014-01-13 13:18:00 U-Pump 55.25 9551 2014-01-13 13 13
2014-01-13 13:28:00 Hippokampos 30.51 9551 2014-01-13 13 13
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551 2014-01-13 13 19
2014-01-13 19:30:00 Ouzeri Elian 28.75 9551 2014-01-13 13 19

We can see the cc owner make the only only transaction at “Daily Dealz” at early morning (6 o’clock), which is the only one transaction in the two weeks.

Besides, “U-Pump” is a special place because there were only two consumption records in the two weeks, which can be found in Figure 4 and Figure 1. Therefore, there should have fewer stop locations near U-Pump in the car GPS data.

We can check the stop locations on day 13. On this day, there was one point near U-Pump where the stop time is near the consumption time in “U-Pump”. The corresponding car id is 24.

Thus, we think that Minke, the owner of car 24, use the credit card 9551. Let’s draw the moving path of this car to discover more. All elements are draw in the same plot to enhance understanding

Q2-Fig1 Code
gps2_stop_day13 <- gps2_stop_sf %>% 
  filter(day ==13)

gps2_stop_car24_day13 <- gps2_stop_sf %>% 
  filter(day ==13 & id == 24)

gps_path_car24_day13 <- gps_path %>% 
  filter(day == 13 & id == 24)

map1 <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps_path_car24_day13) +
  tm_lines(col = "blue") +
  tm_shape(gps2_stop_day13) +
  tm_dots() +
  tm_shape(gps2_stop_car24_day13) +
  tm_dots(col = "blue", size = 0.1)
tmap_leaflet(map1)

Figure 5: Stop locations and driving path of car 24 on day 13

Hovering over blue dots, we can see the stop locations of car 24. On day 13, the car started running at about 7 o’clock from home (the east area in the map) and stopped at “Katerina’s Café” (the south-east area) for half an hour. Then, the car stopped near “Albert’s Fine Clothing” at around noon (the north-west area).

After that, the car stopped near “U-Pump” (the center area) from 12:35 to 13:22. The purchase time in “U-Pump”, 13:18:00, matches the time period.

From 13:27 to 17:57, the car stopped at the GASTech company (south area), which could be the employee was working.

After the work, the car stopped near the “Brew’ve Been Served” (the south-east area) from 18:00 to 19:29. The high consumption occured in this period. The stop location is also close to the “Frydos Autosupply n’ More”. So the driver might stopped the car and walked to the “Frydos Autosupply n’ More” to make the consumption.

There are strange things.

  1. The consumption at “Daily Dealz” occurred at 06:04:00, while the car left home at 07:32:01. It’s strange that the purchase happedned so early and the location can’t be found in other records

  2. The consumption at “Hippokampos” occurred at 13:28:00, while the car stopped at the company at 13:27:14. The time gap is about 30 seconds

  3. The consumption at “Ouzeri Elian” occurred at 19:30:00, while the car left the “Frydos Autosupply n’ More” at 19:29:01. The time gap is just 30 seconds after the car left

We can check the consumption from the combination of credit and loyalty cards data. We use left join to find the corresponding records in the loyalty data.

knitr::kable(cc %>% 
               filter(day == 13 & last4ccnum == 9551) %>% 
               left_join(loyalty, by = c("location", "day", "price")),
             caption = "Consumption record of cc 9551 with corresponding loyalty records on day 13") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 8: Consumption record of cc 9551 with corresponding loyalty records on day 13
timestamp.x location price last4ccnum date day hour timestamp.y loyaltynum
2014-01-13 06:04:00 Daily Dealz 2.01 9551 2014-01-13 13 6 NA NA
2014-01-13 13:18:00 U-Pump 55.25 9551 2014-01-13 13 13 NA NA
2014-01-13 13:28:00 Hippokampos 30.51 9551 2014-01-13 13 13 2014-01-13 L5777
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551 2014-01-13 13 19 NA NA
2014-01-13 19:30:00 Ouzeri Elian 28.75 9551 2014-01-13 13 19 2014-01-13 L5777

We can see that the two consumption records, which have little time gap with the car leaving/stopping, exactly have corresponding loyalty card usage. But the other three consumption records should be less rush but didn’t use loyalty card. One possible explaination might be the card stealing. This suspicious activity need to be analyzed further in question 5.

Anomaly 2: Mid-night consumption at “Kronos Mart”, Figure 4

The first step is to find the corresponding records.

knitr::kable(cc %>% 
               filter(location == "Kronos Mart"),
             caption = "Consumption at Kronos Mart") %>% 
  kableExtra::kable_paper("hover", full_width = F)
Table 9: Consumption at Kronos Mart
timestamp location price last4ccnum date day hour
2014-01-10 09:30:00 Kronos Mart 203.91 7688 2014-01-10 10 9
2014-01-12 03:39:00 Kronos Mart 277.26 8156 2014-01-12 12 3
2014-01-13 03:00:00 Kronos Mart 147.30 5407 2014-01-13 13 3
2014-01-13 08:01:00 Kronos Mart 159.06 6816 2014-01-13 13 8
2014-01-14 08:20:00 Kronos Mart 58.85 6899 2014-01-14 14 8
2014-01-16 07:30:00 Kronos Mart 298.83 7108 2014-01-16 16 7
2014-01-17 08:08:00 Kronos Mart 286.24 1415 2014-01-17 17 8
2014-01-19 03:13:00 Kronos Mart 87.66 3484 2014-01-19 19 3
2014-01-19 03:45:00 Kronos Mart 194.51 9551 2014-01-19 19 3
2014-01-19 03:48:00 Kronos Mart 150.36 8332 2014-01-19 19 3

The strange consumption records are the last 3 rows, which occurred at 3 o’clock on day 19 by the owner of credit cards 3484, 9551, 8332.

Coincidentally, credit card 9551 also appeared in the Anomaly 1.

Day 19 is one day before the employee missing incident. We can check the car stop points in the recent one week to find the reason or any anomalies.

Q2-Fig2 Code
gps2_stop_days <- gps2_stop_sf %>%
  filter(between(day,13,18))

gps2_stop_day19 <- gps2_stop_sf %>%
  filter(day == 19)

map2 <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps2_stop_days) +
  tm_dots(size = 0.1, alpha = 0.5) +
  tm_shape(gps2_stop_day19) +
  tm_dots(col = "red", size = 0.1, alpha = 0.5)
tmap_leaflet(map2)

Figure 6: Stop locations from day 13 to day 19

The “Kronos Mart” is located at the west direction with a red symbol. After zooming the map, we can see there were no car stop location near the mart On day 19 (red dot). And several closer red dot, which located at “Roberts and Sons”, were in the afternoon (stop period within 13 to 14 o’clock).

But there are three black dot which are very close to the “Kronos Mart”. Furthermore, The three car stop all started at about 13:30 and ends at about 16:00 on day 18.

The owners of the three cars are listed.

gps2_stop_days %>% 
               filter((id == 1 | id == 10 |id == 23) 
                      & day == 18
                      & start > "2014-01-18 13:00:00"
                      & end < "2014-01-18 16:00:00") %>% 
knitr::kable(caption = "The three car stop near Kronos Mart") %>% 
  kableExtra::kable_paper("hover", full_width=T)
Table 10: The three car stop near Kronos Mart
start end id diff_mins day geometry LastName FirstName CurrentEmploymentType CurrentEmploymentTitle
2014-01-18 13:48:01 2014-01-18 15:14:01 23 86.0 mins 18 POINT (24.8498 36.06586) Lagos Varja Security Badging Office
2014-01-18 13:29:31 2014-01-18 15:52:01 10 142.5 mins 18 POINT (24.84983 36.06588) Campo-Corrente Ada Executive SVP/CIO
2014-01-18 13:36:43 2014-01-18 15:58:01 1 141.3 mins 18 POINT (24.84982 36.06582) Calixto Nils Information Technology IT Helpdesk

We can’t get insights from the car owner information since they belong to different employment type. But they stayed at the same location for similar time period. They are very likely to meet each other and do the same thing. Besides, the consumption at 3 o’clock came from 3 credit cards and this meetup also involved in 3 persons.

Thus, one possible explanation of the consumption at mid-night could be that the three car owners came to discuss some plans on day 18 and met again at 3 o’clock on day 19.

Another possible explanation direction could be persons just stayed near the mart, so they don’t need to drive and walked there to make consumption. Or the three person use other vehicles, not from the company, to reach the mart and make consumption.

This suspicious activities will be analyzed further in question 5. It might need to check the behaviors of the three car owners in the 14 days.

Discrepancies between data
cc_num <- length(unique(cc$last4ccnum))
loyalty_num <- length(unique(loyalty$loyaltynum))
ppl_num <- length(car_assignments$LastName)
c(cc_num, loyalty_num, ppl_num)
[1] 55 54 44

We can find that there are 44 employees, but 55 credit cards and 54 loyalty cards. If we suppose no errors in the card id, one employee has one or more credit cards and one or more loyalty cards.

Q3: Infer card owners

Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?

If one consumption time fall with one car stop period, we believe it’s possible one correspondence. There might be several different purchase in one car stop period, but we can infer that the most common pair within one group is the most likely true pair.

The credit card data contain specific time, but loyalty card data doesn’t. Thus, We will find the relationship between credit cards and loyalty cards. After that, we will match the credit card and car stop. The owners of loyalty card can be inferred from the relationship between credit card and car stop and the relationship between credit cards and loyalty cards.

Credit and loyalty cards pairs

We full join the two card table by matching day, location and price. It’s seldom that two different consumption will have the same values in the three features. Then, we use group_by() to find all pairs of two cards and count the consumption frequency of the pair.

There will be some rows which can’t match. This might be someone used only one of the two cards or got cashback. We filter out these situations, where the card pair contains null value.

# # make a full join
card_correspond_count <- full_join(cc, loyalty,
                             by = c("day", "location", "price")) %>%
  # calculate frequency
  group_by(last4ccnum, loyaltynum) %>%
  summarise(count = n()) %>%
  # filter out mismatch
  drop_na()

# convert 'last4ccnum' into string to plot 
card_correspond_count$last4ccnum <- as.character(card_correspond_count$last4ccnum)

Most pairs are one-on-one. It’s confident to conclude there pairs are true (credit and loyalty card in each pair belong to one owner).

card_correspond_count_one2one <- card_correspond_count %>% 
  filter((n_distinct(last4ccnum)==1 & n_distinct(loyaltynum)==1))

knitr::kable(card_correspond_count_one2one,
             caption = "One-on-one matched pairs") %>% 
  kableExtra::kable_paper("hover", full_width = F) %>% 
  kableExtra::scroll_box(height = "300px")
Table 11: One-on-one matched pairs
last4ccnum loyaltynum count
1310 L8012 21
1321 L4149 22
1415 L7783 24
1874 L4424 25
1877 L3014 18
2142 L9637 25
2276 L3317 10
2418 L9018 20
2463 L6886 23
2540 L5947 20
2681 L1107 20
3484 L2490 24
3492 L7814 22
3506 L7761 6
3547 L9362 14
3853 L1485 22
4434 L2169 26
4530 L8477 10
5010 L2459 5
5407 L4034 20
6691 L6267 20
6816 L8148 20
6895 L3366 21
6899 L6267 23
6901 L9363 28
7108 L6544 16
7117 L6417 28
7253 L1682 24
7354 L9254 21
7384 L3800 26
7688 L4164 22
7792 L5756 20
7819 L5259 27
8129 L8328 21
8156 L5224 22
8202 L2343 25
8411 L6110 27
8642 L2769 12
9152 L5485 2
9220 L4063 9
9241 L3288 13
9405 L3259 21
9551 L5777 12
9614 L5924 2
9617 L5553 26
9635 L3191 24
9683 L7291 18
9735 L9633 16

For those who have matched but not one-on-one, we can plot their parallel graph.

Q3-Fig1 Code
# filter multiple match
card_correspond_count_others <- card_correspond_count %>%
  filter(n_distinct(last4ccnum)>1 | n_distinct(loyaltynum)>1)

card_correspond_count_others_plot <- card_correspond_count_others  %>%
  gather_set_data(1:2) %>%        # <- ggforce helper function
  arrange(x,last4ccnum,desc(loyaltynum))
# plot
ggplot(card_correspond_count_others_plot, 
       aes(x = x, id = id, split = y, value = count)) +
  geom_parallel_sets(aes(fill = last4ccnum), alpha = 0.7, 
                     axis.width = 0.2, n=100, strength = 0.5) +
  geom_parallel_sets_axes(axis.width = 0.25, fill = "gray95",
                          color = "gray80", size = 0.15) +
  geom_parallel_sets_labels(colour = 'gray35', size = 4.5, 
                            angle = 0, fontface="bold") +
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.y = element_blank(),
    axis.text.x = element_text(size = 12, face = "bold"),
    axis.title.x  = element_blank()
    )
Multiple matched pairs

Figure 7: Multiple matched pairs

We can infer that the pair with a wider line is the true pair. For example, credit card 4795 (Blue line) matches both L2070 and L8566 loyalty cards, but we can make sure that L8566 is the true pair. Because the the consumption count by L8566, which is displayed as the line width, is much higher than L2070.

knitr::kable(card_correspond_count_others,
             caption = "Other matched pairs") %>% 
  kableExtra::kable_paper("hover", full_width = F) %>% 
  kableExtra::scroll_box(height = "300px")
Table 12: Other matched pairs
last4ccnum loyaltynum count
1286 L3288 15
1286 L3572 13
4795 L2070 1
4795 L8566 25
4948 L3295 1
4948 L9406 22
5368 L2247 24
5368 L6119 1
5921 L3295 12
5921 L9406 1
7889 L2247 1
7889 L6119 20
8332 L2070 27
8332 L8566 1

These matched pairs with 1 count might contain some suspicious activities. And we can assign the rows with over 5 count to be true pairs.

In the final predicted card pairs, only credit cards 1286 correspond to multiple loyalty cards (L3288, L3572), which can be found in Figure 7

card_correspond_count_others_ture <- card_correspond_count_others %>% 
  filter(count > 5)
# union the two true pairs table
card_correspond_true <- bind_rows(card_correspond_count_one2one, 
                                  card_correspond_count_others_ture)
Match GPS and credit card data

To match credit card consumption and GPS data, we can assume that one car stop corresponds to one consumption if the consumption time falls within the car stop period at the same location.

But before that, we have to label GPS of car stops with specific locations.

All car stop locations are plotted on the map. And we can see there are many locations where the car stopped for over 6 hours (red dot on the map). Most of them are near the five parks (along the coast).

We are interested in car stops where credit card consumption happened, so we should exclude these stops which are very likely at home.

Besides, we notice that there are some long car stop near other locations. Those blue dots near “Ouzeri Elian” all belong to Isande, car 28. And he/she drives car very regular: stops at about 8:00 and leave at about 17:00.

It’s the same for car 9, Gustav. The car has many long stops near “Bean There Done That” (north-west area): stopped at about 17:00 and start moving on the second day at about 8:00. It seems that he lives here.

Q3-Fig2 Code
gps2_stop_long <- gps2_stop_sf %>%
  filter(diff_mins >= 60*6)

gps2_stop_short <- gps2_stop_sf %>%
  filter(diff_mins < 60*6)

map4 <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps2_stop_short) +
  tm_dots(size = 0.1, alpha = 0.5) +
  tm_shape(gps2_stop_long) +
  tm_dots(col = "blue", size = 0.2, alpha = 0.3)
tmap_leaflet(map4)

Figure 8: Long and short car stop

After excluding the long on short car stop, We can find that these car stops are still messy in Figure 8. It’s difficult to distinctly group car stops and label with locations.

Many car stops locations might not correspond to any local business locations. And some car stops, which actually correspond consumption in one location, don’t have close GPS location. Furthermore, some locations are very near each other.

To fix this issue, we can find the most confident correspondence firstly. For example, the multiple dots near the “Abila Airport”(southwest) should be a clear/distinct group. They are not single stops, close within one group and far from other dots&locations. Thus, We can believe that they correspond to the airport location with high confidence.

It’s the same for “Maximum Iron and Steel”(west), “Abila Scrapyard”(northwest), “Frank’s Fuel”(west), “Bean There Done That”(northwest), “Coffee Cameleon”(southeast), “Chostus Hotel”(northeast).

Some dots might not be a distinct group and there are several dot groups near one location. But some groups still can be labeled to one location with confidence, such as some dots along the street of the “Kronos Mart”(west). “Roberts and Sons”(west), “Desafio Golf Course”(northeast), “Albert’s Fine Clothing”(north), “Jack’s Magical Beans”(northeast), “Hallowed Grounds”(east) also have such dots groups.

After labeling these dots groups, we can match the credit card by the timestamp and location labels. It give us possible pairs of car id and credit card. For each pair, we will check whether every record has a unique corresponding a car stop record with this pair among 14 days. If all matches, this pairs will be regard as a confident pair.

# add distinct index to select GPS records later
gps2_stop_short$idx <- c(1:2659)
# calculate the distance between any two dot
distance_matrix <- st_distance(gps2_stop_short$geometry, gps2_stop_short$geometry)
# add a new col for labeling locations
gps2_stop_short$location <- ""

Take the dots group near “Frank’s Fuel”(west) as example:

We can hover in the map above to find one of dots close to the location and get the distinct ‘idx’ of this car stop.

Then we need to filter out all dots which belong to this group. To do this, we find all dots whose distances with it are less than 50 meters. And plot them on the map to check whether there are dots missing or the range of 50 meters is so large that it includes other others.

## "Frank's Fuel"(west), 
# start with dot 2545, which is near this location
# find all dots which have less than 50 meter in distance
frank_idx <- which(as.integer(distance_matrix[2545,]) < 50)
dot_group <- gps2_stop_short %>%
  filter(idx %in% frank_idx)
others <- gps2_stop_short %>%
  filter(!idx %in% frank_idx)
# check whether there are other dots near this group but with different labels (blue dots) on the map
# if yes, we need to change the dot distance from default value(50) to a bigger value
mapx <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(dot_group) +
  tm_dots(size = 0.1, alpha = 0.4) +
  tm_shape(others) +
  tm_dots(col = "blue", size = 0.2, alpha = 0.5)
tmap_leaflet(mapx)

After filtering this dot group, we will label them with location “Frank’s Fuel” and match with consumption records in credit cards by location and timestamp(the consumption time need to be within the start time and the end time of the car stop). It will give us possible pairs of one car and one credit card.

# label them with this location
gps2_stop_short$location[gps2_stop_short$idx %in% frank_idx] <- "Frank's Fuel"
# match records in cc
gps2_stop_short %>% 
  filter(idx %in% frank_idx) %>% 
  left_join(cc, by = c('location')) %>% 
  filter(timestamp > start & timestamp < end) %>% 
  select(id, start, end, timestamp, last4ccnum)
Simple feature collection with 2 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.84133 ymin: 36.07212 xmax: 24.84135 ymax: 36.07213
Geodetic CRS:  WGS 84
# A tibble: 2 x 6
     id start               end                 timestamp          
  <dbl> <dttm>              <dttm>              <dttm>             
1    15 2014-01-08 11:45:01 2014-01-08 12:33:01 2014-01-08 12:29:00
2     3 2014-01-18 18:07:01 2014-01-18 18:40:01 2014-01-18 18:39:00
# ... with 2 more variables: last4ccnum <dbl>, geometry <POINT [°]>
gps2_stop_short %>% 
  filter(idx %in% frank_idx) %>% 
  left_join(cc, by = c('location')) %>% 
  filter(timestamp > start & timestamp < end)%>% 
  select(id, start, end, timestamp, last4ccnum)
Simple feature collection with 2 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.84133 ymin: 36.07212 xmax: 24.84135 ymax: 36.07213
Geodetic CRS:  WGS 84
# A tibble: 2 x 6
     id start               end                 timestamp          
  <dbl> <dttm>              <dttm>              <dttm>             
1    15 2014-01-08 11:45:01 2014-01-08 12:33:01 2014-01-08 12:29:00
2     3 2014-01-18 18:07:01 2014-01-18 18:40:01 2014-01-18 18:39:00
# ... with 2 more variables: last4ccnum <dbl>, geometry <POINT [°]>

we find 2 possible pairs. Let’s check them separately.

For car id 15 and cc 3853:

# match all records of car id 15, cc 3853
match_cc <- subset(cc,last4ccnum == 3853) %>% 
  left_join(subset(gps2_stop_short, id == 15, select = c(start,end,idx, day)),
            by = c('day')) %>% 
  filter(timestamp > start & timestamp < end)

Check whether there are cc records which match multiple pairs.

match_cc %>% 
  group_by(idx) %>% 
  summarize(count = n()) %>% 
  filter(count >1)
# A tibble: 0 x 2
# ... with 2 variables: idx <int>, count <int>

Check whether there are any cc records which are not matched.

subset(cc,last4ccnum == 3853) %>%
  setdiff(match_cc[1:7])
# A tibble: 0 x 7
# ... with 7 variables: timestamp <dttm>, location <chr>,
#   price <dbl>, last4ccnum <dbl>, date <date>, day <int>, hour <int>
# All match, so label them in GPS with respective locations
gps2_stop_short$location[gps2_stop_short$idx %in% match_cc$idx] <- match_cc$location

All records match, we think the credit card 3853 belongs to the owner of the car 15.

For car id 3, cc 9635, the steps are the same.

# match all records of car id 3, cc 9635
match_cc <- subset(cc,last4ccnum == 9635) %>% 
  left_join(subset(gps2_stop_short, id == 3, select = c(start,end,idx, day)),
            by = c('day')) %>% 
  filter(timestamp > start & timestamp < end)
# check whether there are cc records which match multiple car stops
match_cc %>% 
  group_by(idx) %>% 
  summarize(count = n()) %>% 
  filter(count >1)
# A tibble: 0 x 2
# ... with 2 variables: idx <int>, count <int>
# check whether there are any cc records which are not matched
subset(cc,last4ccnum == 9635) %>% 
  setdiff(match_cc[1:7])
# A tibble: 2 x 7
  timestamp           location price last4ccnum date         day  hour
  <dttm>              <chr>    <dbl>      <dbl> <date>     <int> <int>
1 2014-01-14 12:00:00 Bean Th~ 19.4        9635 2014-01-14    14    12
2 2014-01-15 12:00:00 Bean Th~  5.34       9635 2014-01-15    15    12
# 2 out of 26 not match, but their locations are "Bean There Done That", it's still confident pair
# label them in GPS with respective locations
gps2_stop_short$location[gps2_stop_short$idx %in% match_cc$idx] <- match_cc$location

The credit card 9635 belongs to the owner of the car 3.

Other dot groups will go through the same process to find pairs of cars and credits.

Mannual work
### confident pair

### cars&consumption with pairs

### match new pairs

### 

Q4: Potential Relationships

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships

Q4-Fig1
1
[1] 1
Q4-Fig2
1
[1] 1
Q4-Fig3
1
[1] 1

Q5: Suspicious Activities

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.

Q5-Fig1
1
[1] 1
Q5-Fig2
1
[1] 1
Q5-Fig3
1
[1] 1

trucks which were used for non-business issue

day*hour Now, let’s divide the units from days into hours:

# cc_freq_day_hour <- as.data.frame(xtabs(~location++day+hour, data = cc))
# cc_freq_day_hour$hour <- as.numeric(levels(cc_freq_day_hour$hour))[cc_freq_day_hour$hour]
# p3 <- ggplot(cc_freq_day_hour,aes(x=hour,y=location))+
#   geom_tile(aes(fill=Freq),color="white")+
#   scale_fill_gradient(low = "#EFF7FB", high = "#0D2330")+
#   theme(panel.background = element_blank(),
#         axis.ticks = element_blank(),
#         axis.title = element_blank(),
#         legend.title=element_blank(),
#         plot.title = element_text(hjust=0.5))+
#   facet_wrap(~ day, ncol = 7)+
#   labs(title = "CC Frequency by hour of the day") 
# ggplotly(p3)

Conclusion

# knitr::kable(card_correspong_count,
#              caption = ) %>%
#   kableExtra::kable_paper("hover", full_width = F)
1
[1] 1